🐛 Add checks to catch race-driven panics in metrics endpoint e2e #1795

bentito · 2025-02-19T21:25:35Z

Description

There’s potential for flakiness due to timing issues. For example, the TestOperatorControllerMetricsExportedEndpoint assumes that:
• The ClusterRoleBinding and token are immediately effective.
• The curl pod becomes ready promptly.
• The metrics endpoint is available as soon as the pod is ready.

In slower or variable environments, these assumptions might not hold, leading to intermittent failures. Adding explicit waits or retry logic for each resource’s readiness could help mitigate race conditions.

To fix this, this PR:

• Adds a generic retry helper that attempts a function up to 5 times with a 5-second delay between tries.
• Wraps key steps in the metrics endpoint tests with the retry logic:
• Creating the ClusterRoleBinding.
• Generating the ServiceAccount token.
• Creating the curl pod.
• Validating the metrics endpoint via curl.
• Refactored these steps to return errors instead of calling require.NoError immediately, so transient issues are retried before ultimately failing.
• This change minimizes test flakiness due to race conditions or temporary resource unavailability.

Currently on main, on my local machine, I ran this script:

#!/bin/bash

pass_count=0
total=10

for i in $(seq 1 $total); do
    echo "Running test iteration $i/$total..."
    if make test-e2e > /dev/null 2>&1; then
        echo "Iteration $i: PASS"
        ((pass_count++))
    else
        echo "Iteration $i: FAIL"
    fi
done

echo "Summary: Passed $pass_count/$total times."

Failed on the 2nd try on main:

$ ./scripts/test_e2e_10_times.sh
Running test iteration 1/10...
Iteration 1: PASS
Running test iteration 2/10...
Iteration 2: FAIL
Running test iteration 3/10...
^C

I'm running the script now on this branch and will update with my findings.
I gotta make the script report more, to make sure this is the e2e actually causing the fail, but looks like no joy :-(

run script on branch:

Running test iteration 1/10...
Iteration 1: PASS
Running test iteration 2/10...
Iteration 2: PASS
Running test iteration 3/10...
Iteration 3: FAIL
Running test iteration 4/10...
Iteration 4: PASS
Running test iteration 5/10...
Iteration 5: FAIL
Running test iteration 6/10...

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2025-02-19T21:25:52Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`493315f`
🔍 Latest deploy log	https://app.netlify.com/sites/olmv1/deploys/67be3b90876b3a0008825ec5
😎 Deploy Preview	https://deploy-preview-1795--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

bentito · 2025-02-19T21:27:19Z

Tries to fix #1781

test/e2e/metrics_test.go

codecov · 2025-02-19T21:50:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.43%. Comparing base (7040ee2) to head (493315f).
Report is 16 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1795      +/-   ##
==========================================
+ Coverage   68.34%   68.43%   +0.09%     
==========================================
  Files          63       62       -1     
  Lines        5117     5117              
==========================================
+ Hits         3497     3502       +5     
+ Misses       1390     1387       -3     
+ Partials      230      228       -2

Flag	Coverage Δ
e2e	`51.78% <ø> (+0.15%)`	⬆️
unit	`55.89% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joelanford · 2025-02-19T21:51:07Z

The ClusterRoleBinding and token are immediately effective.
The curl pod becomes ready promptly.
The metrics endpoint is available as soon as the pod is ready.

Are there examples in the flakes for each of these assumptions not holding? If so, I think we should look at each one to understand what's happening and then decide what the correct fix is.

test/e2e/metrics_test.go

camilamacedo86

Hi @bentito,

Terrific work! 🚀 This is looking great.

Just a couple of minor suggestions:

The PR description and commit history seem to be a bit out of sync with the latest changes. It would be great to update the description accordingly and squash the commits to maintain a clean history.
The 30s timeout might be a bit short, especially considering that the --force option is used here: 🐛 Add checks to catch race-driven panics in metrics endpoint e2e #1795 (comment) Since the standard timeout across tests is 1 minute, it might be better to align with that for consistency.
Other than that, everything looks good to me! 👍

Thanks for your work on this. 🎉

joelanford · 2025-02-24T20:33:38Z

test/e2e/metrics_test.go

+	// Wait for the ClusterRoleBinding to be deleted.
+	if err := waitForDeletion(ctx, c.client, "clusterrolebinding", c.clusterBinding); err != nil {
+		c.t.Logf("Error waiting for clusterrolebinding deletion: %v", err)
+	} else {
+		c.t.Log("ClusterRoleBinding deleted")
+	}
+
+	// Wait for the Pod to be deleted.
+	if err := waitForDeletion(ctx, c.client, "pod", c.curlPodName, "-n", c.namespace); err != nil {
+		c.t.Logf("Error waiting for pod deletion: %v", err)
+	} else {
+		c.t.Log("Pod deleted")
+	}


Do we really need to even wait for these to be deleted?

+1, good point.
But I am betting that is the fix.
In the scenarios where it fails, because the Pod get stuck, we are now just logging in instead of failing.

to be clear: I have a pretty quick test ((takes about ~50m) -- run e2e 10 times. This test always fails on my local box when run on main, at least one fail out of 10, usually 3-5 fails. With the changes on this PR, that same test passes. I think @camilamacedo86 is right, the change in behavior is key to the fix, that we log instead of fail. I guess I could try not waiting, but I'd rather move on as something is fixing the flake.

tmshort · 2025-02-25T15:56:02Z

It happened here:
https://github.com/operator-framework/operator-controller/actions/runs/13524917676/job/37792916935?pr=1810

There’s potential for flakiness due to timing issues. For example, the test assumes that: • The ClusterRoleBinding and token are immediately effective. • The curl pod becomes ready promptly. • The metrics endpoint is available as soon as the pod is ready. Seems like these all can be covered by retry logic in the validate method. Signed-off-by: Brett Tofel <[email protected]>

LalatenduMohanty

/lgtm

LalatenduMohanty · 2025-02-26T04:41:30Z

/hold for @joelanford @camilamacedo86 or @azych to review

camilamacedo86

Hi @bentito,

This might not fully address the root cause—why was it not possible to delete? The fix mainly occurs because we are logging instead of failing, as discussed in this comment.

That said, I don’t see much value in spending more effort on it at this point. You did a great job addressing the intermittent scenario, and we can always revisit this if needed in the future. 🥇 I don’t see why we shouldn't move forward with this approach. If further tests show that the Pod/ClusterBinding isn’t being cleaned up properly in some cases due to another reason than timeout, or when we add more tests, we face intermittent issues, then yes, we might need to look further, but now, it looks great to me! 👍

/lgtm
/approve

azych · 2025-02-26T07:43:40Z

I share the sentiment - I'm still unsure about the root cause, but if this works and other people are good with it, I'm OK too.
We can always revisit.

/lgtm

bentito requested a review from a team as a code owner February 19, 2025 21:25

joelanford reviewed Feb 19, 2025

View reviewed changes